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Abstract 

Gaussian process emulators of computationally expensive computer codes provide fast statistical approxima¬ 
tions to model physical processes. The training of these surrogates depends on the set of design points chosen 
to run the simulator. Due to computational cost, such training set is bound to be limited and quantifying the 
resulting uncertainty in the hyper-parameters of the emulator by uni-modal distributions is likely to induce 
bias. In order to quantify this uncertainty, this paper proposes a computationally efficient sampler based on 
an extension of Asymptotically Independent Markov Sampling, a recently developed algorithm for Bayesian 
inference. Structural uncertainty of the emulator is obtained as a by-product of the Bayesian treatment 
of the hyper- parameters. Additionally, the user can choose to perform stochastic optimisation to sample 
from a neighbourhood of the Maximum a Posteriori estimate, even in the presence of multimodality. Model 
uncertainty is also ackno'wledged through numerical stabilisation measures by including a nugget term in 
the formulation of the probability model. The efficiency of the proposed sampler is illustrated in examples 
■where multi-modal distributions are encountered. For the purpose of reproducibility, further development, 
and use in other applications the code used to generate the examples is freely available for download at 
https;//github.com/agarbuno/paims_codes. 
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1. Introduction 

Computationally expensive computer codes are frequently needed to implement mathematical models 
which are assumed to be reliable approximations to physical processes. Such simulators often require intensive 
use of computational resources that makes them inefficient if further expfoitation of the code is needed, e.g. 
optimisation, uncertainty propagation and sensitivity analysis [Forrester et ah, 2008, Kennedy and O’Hagan, 
2001a]. For this reason, surrogate models are needed to perform fast approximations to the output of 
demanding simulators and enable efficient exploration and exploitation of the input space. In this context, 
Gaussian processes are a common choice to build statistical surrogates -also known as emulators- which allow 
to take into account the uncertainty derived from the inability to evaluate the original model in the whole 
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input space. Gaussian processes have become popular in recent years due to their ability to fit complex 
mappings between outputs and inputs by means of a non-parametric hierarchical structure. Such applications 
are found, amongst many other areas, in Machine Learning [Rasmussen and Williams, 2006] , Spatial Statistics 
[Cressie, 1993] (with the name of Kriging), likelihood-free Bayesian Inference [Wilkinson, 2014] and Genetics 
[Kalaitzis and Lawrence, 2011]. 

To build an emulator, a number of runs from the simulator is needed, but due to computing limitations 
only a small amount of evaluations can be performed. With a small amount of data, it is possible that the 
uncertainty of the parameters of the model cannot be described by a clearly uni-modal distribution. In such 
scenarios. Model Uncertainty Analysis [Draper, 1995] is capable of setting a proper framework in which we 
acknowledge all uncertainties related to the idealisations made through the modelling assumptions and the 
available, albeit limited information. To this end, hierarchical modelling should be taken into account. This 
corresponds to adding a layer of structural uncertainty to the assumed emulator either in a continuous or 
discrete manner [see Draper, 1995, §4]. In the case of Gaussian processes, continuous structural uncertainty 
can be accounted for as a natural by-product from a Bayesian procedure. Hence, this is pursued in this work 
by focusing on samplers capable of exploring multi-modal distributions. 

In order for the Gaussian process to be able to replicate the relation between inputs and outputs and 
make predictions, a training phase is necessary. Such training involves the estimation of the parameters of 
the Gaussian process from the data collected by running the simulator. These parameters are referred to 
as hyper-parameters. The selection of the hyper-parameters is usually done by using Maximum Likelihood 
estimates (MLE), or their Bayesian counterpart Maximum a Posteriori estimates (MAP) [Oakley, 1999, 
Rasmussen and Williams, 2006], or by sampling from the posterior distribution [Williams and Rasmussen, 
1996] in a fully Bayesian manner. 

In this paper we assume a scenario where the task of generating new runs from the simulator is prohibitive. 
Such limited information is not enough to completely identify either a candidate or a region of appropriate 
candidates for the hyper-parameters. In this scenario, traditional optimisation routines [Nocedal and Wright, 

2004] are not able to guarantee global optima when looking for the MLE or MAP, and a Bayesian treatment is 
the only option to account for all the uncertainties in the modelling. In the literature, however, it is common 
to see that MLE or MAP alternatives are preferred [Kennedy and O’Hagan, 2001a, Gibbs, 1998] due to the 
numerical burden of maximising the likelihood function or because it is assumed that Bayesian integration 
will not produce results worth the effort. Though it is a strong argument in favour of estimating isolated 
candidates, in high-dimensional applications it is difficult to assess if the number of runs of the simulator is 
sufficient to produce robust hyper-parameters. Robustness is usually measured with a prediction-oriented 
metric such as root-mean-square error (RMSE) [Kennedy and O’Hagan, 2001b], ignoring uncertainty and 
risk assessment of choosing a single candidate of the hyper-parameters by an inference process with limited 
data. In order to account for such uncertainty in the hyper-parameters when making predictions, numerical 
integration should be performed. However, methods as quadrature approximation become infeasible as the 
number of dimensions increases [Kennedy and O’Hagan, 2001a]. Therefore, an appropriate approach is to 
perform Monte Garlo integration [MacKay, 1998]. This allows to approximate any integral by means of a 
weighted sum, given a sample from the correct distribution. 

In Gaussian processes, as in many other applications of statistics, the target distribution of the hyper¬ 
parameters cannot be sampled directly and one should resort to Markov Chain Monte Carlo (MCMC) methods 
[Robert and Casella, 2004]. MCMC methods are powerful statistical tools but have a number of drawbacks if 
not tuned properly, particularly if one wishes to sample from multi-modal distributions [Neal, 2001, Hankin, 

2005] . One of such limitations is the tuning of the proposal distribution, which allows the generation of a 
candidate in the chain. This proposal function has to be tuned with parameters that define its ability to 
move through the sample space. If an excessively wide spread is selected, this will produce samples with 
space-filling properties but which are likely to be rejected. On the other hand, having a narrower spread 
will cause an inefficient exploration of the sample space by taking short updates of the states of the chain, 
known in the literature as Random Walk behaviour [Neal, 1993]. In practice it is desirable to use a proposal 
distribution which is capable of balancing both extremes. To find an appropriate tuning in high-dimensional 
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spaces with sets of highly correlated variables can be an overwhelming task and often MCMC samplers can 
become expensive due to the long time needed to reach stationarity [Ching and Chen, 2007]. Neal [1998] 
and Williams and Rasmussen [1996] favour the Hybrid Monte Carlo (HMC) method to generate a sample 
from the posterior distribution, preventing the random walk behaviour of traditional MCMC methods. If 
tuned correctly, the HMC should be able to explore most of the input space [Liu, 2008] . Such tuning process 
is problem-dependent and there is no guarantee that the method will sample from all existing modes, thus 
failing to adapt well to multi-modal distributions [Neal, 2011]. 

This paper proposes a sampler for the hyper-parameters of a Gaussian process based on recently developed 
methods for Bayesian inference problems. Additionally, it uses the Transitional Markov Chain Monte Carlo 
(TMCMC) method of Ching and Chen [2007] to set a framework for the parallelisation of Asymptotically 
Independent Markov Sampling in both the context of hyper-parameter sampling (AIMS) [Beck and Zuev, 
2013] and in stochastic optimisation (AIMS-OPT) [Zuev and Beck, 2013] reminiscent of Stochastic Subset 
Optimisation [Taflanidis and Beck, 2008a,b]. Such an extension is built using concepts of Particle Filtering 
methods [Andrieu et ah, 2010, Gramacy and Poison, 2009], Adaptive Sequential Monte Carlo [Del Moral 
et ah, 2006, 2012] and Delayed Rejection Samplers [Zuev and Katafygiotis, 2011, Mira, 2001]. AIMS is chosen 
since it provides a framework for Sequential Monte Carlo sampling [Neal, 1996, 2001, Del Moral et al., 2006] 
which automatically chooses the sequence of transitions. Moreover, it uses most of the information generated 
in the previous step in the sequence as opposed to traditional sequential methods, thus building a robust 
sampler when applied to multi-modal distributions. Finally, by using the AIMS-OPT algorithm a solution is 
built by means of a nested sequence of subsets, which converges to the optimal solution set. The algorithm 
can be terminated prematurely given a previously chosen accuracy threshold, thus providing a set of nearly 
optimal solutions. Whether it is composed by a single element, or a set of elements whose objective function 
differs by a negligible quantity, a full characterisation of the optimal solution is achieved. This contrasts 
with the capabilities of other stochastic optimisation schemes such as particle swarm optimisation or genetic 
algorithms [Schneider and Kirkpatrick, 2007]. 

By selecting the hyper-parameters using the AIMS-OPT framework the effect is twofold. First, the 
uncertainty inherent to the specification of the hyper-parameters is embedded in the set of suboptimal 
approximations to the solution. This uncertainty, expressed in a mixture of Gaussian process emulators, 
yields a robust surrogate where model uncertainty is accounted for. Second, computational implementation 
deficiencies of the inference procedure in Gaussian processes is overcome by incorporating stabilising approaches 
exposed in the literature as in Ranjan et al. [2011], Andrianakis and Challenor [2012] but in a Bayesian 
framework. The problem is therefore treated from both a probabilistic and an optimisation perspective. 

The paper is organised as follows. In Section 2, a brief introduction to the Gaussian processes and 
their treatment by Bayesian inference is discussed. Section 3 presents both the AIMS algorithm and the 
proper generalisation for a parallel implementation. Section 4 discusses several aspects of the computational 
implementation of the algorithm and their effect on the modelling assumptions. The efficiency and robustness 
of the proposed sampler are discussed in Section 5 with some illustrative examples. Concluding remarks are 
given in Section 6. 

2. Gaussian processes 

Let X = {xi,..., x„} be the set of trials run by the simulator where x^ S denotes a given configuration 
for the model. The set X will be referred to as the set of design points . Let y = {yi, ...,?/„} be the set of 
outputs observed for the design points. The pair (xi,yi) will denote the training run being used to learn the 
emulator that approximates the simulator. The emulator is assumed to be a real-valued mapping rj : —>■ M 
which is an interpolator of the training runs, i.e. yi = ? 7 (xi) for all z = 1,..., n. This omits any random error 
in the output of the computer code in the observed simulations, that is, the simulator is deterministic. It is 
assumed that the output of the simulator can be represented by a Gaussian process. Therefore, the set of 
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design points is assumed to have a joint Gaussian distribution where the output satisfies the structure 

77(x) =/i(x)^/3 + Z(x|cr^c/)), (2.1) 

where h{-) is a vector of known basis (location) functions of the input, /3 is a vector of regression coefficients, 
and ,cj)) is a Gaussian process with zero mean and covariance function 

cov(x, x'|ct^, (f>) = fc(x, x'l^), (2.2) 

where tr^ is the signal noise and cp G denotes the length-scale parameters of the correlation function k(-, ■). 
Note that for a pair of design points(x,x'), the function k(-, •|</)) measures the correlation between 7f(x) and 
? 7 (x') based on their respective input configurations. The effect of different values of (f> in a one-dimensional 
example is depicted in Figure 1. 



Inputs X 

Figure 1: The length-scale parameters represent how sensitive is the output of the simulator to variations in each 
dimension. The plot corresponds to 8 design points chosen for the function t]{x) = 5-|-a:-|-cos(r)-I-.5 sin(3r). 
For low values of the length-scale parameter the training runs are less dependent of each other. 


The role of the correlation function is to measure how close to each other the design points are, following 
the assumption that similar input configurations should produce similar outputs. For its analytical simplicity, 
interpretation and smoothness properties, this work uses the squared-exponential correlation function, namely 


fc(x, x'|0) = exp 



i=\ 



(2.3) 


Note that other authors prefer the parametrisation with (ff as denominators. However, this work uses a linear 
term in the denominator since the restriction of the length-scale parameters to lie in the positive orthant is 
more natural, as weights in the norm used to measure closeness and sensitivity to changes in such dimensions. 
Both interpretability and numerical performance can be improved if the length-scales refer to the same units, 
which leads to rescaling all dimensions of the input configurations. In the computer simulation terminology 
this translates in utilising experimental designs restricted to hypercubes, such as Latin hypercube sampling or 
Sobol sequences. Design of experiments is an active area of research outside the scope of this work. 

In summary, the output of a design point, given the parameters /3, and 0, has a Gaussian distribution 

y\x, /3, cr^ cf) ~ U{h{xf (3, cr^ fc(x, x'|(/))), (2.4) 


which can be rewritten as the joint distribution of the vector of outputs y conditional on the design points X 
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and hyper-parameters /3, and (p as 


y\X,(3,a\ct)r^M{H(3,a^K), (2.5) 

where H is the design matrix whose rows are the inputs h{xi)'^ and K is the correlation matrix with elements 
Kij = k{xi,Xj\cj)) for all i,j = 1,...,n. 

2.1. Estimating the hyper-parameters 

The parameters of the process are not known beforehand and this induces uncertainty in the emulator 
itself. They can be estimated by Maximum Likelihood principles, but doing so lacks rigorous uncertainty 
quantification by concentrating all the density of the unknown quantities in a single value. The alternative 
is to treat them in a fully Bayesian manner and marginalise them when performing predictions. This way 
their respective uncertainty is taken into account. In this scenario, the prediction y* for a non-observed 
configuration x* can be performed with the data available, T) — (y,^), and the evidence they shed on the 
parameters of the Gaussian process. Therefore, the predictions should be made with the marginalised posterior 
distribution 


p{y*\^\v)= / p[f\^*,v,e)p{e\v)de, ( 2 . 6 ) 

J0 

where 9 = {(3,a^,cj)) denotes the complete vector of hyper-parameters. One should note that given the 
properties of a collection of Gaussian random variables, a prediction for y* conditioned in the data and 6 is also 
a Gaussian random variable [see Oakley, 1999]. As in hierarchical modelling, each possible value of 9 defines 
a specific realisation of a Gaussian distribution, so it is appropriate to refer to 9 as the hyper-parameters of 
the Gaussian process. 

Due to its computational complexity, the integral in (2.6) is often omitted when making predictions. It is 
commonly assumed that the MLE of the likelihood 

>C(0) = p{y\X, f3, cr^ </)), (2.7) 

or the MAP estimate from the posterior distribution 

pi9\V) oc p(y I A, (3, (p) p{(3, a'^,cp), (2.8) 

are robust enough to account for all the uncertainty in the modelling. However, when either the likelihood 
(2.7) is a non-convex function or the posterior (2.8) is a multi-modal distribution, conventional optimisation 
routines might only find local optima, thus failing to find the most probable candidate of such distribution. 
Moreover, by selecting only one candidate, robustness and uncertainty quantification are lost in the process. 
Additionally, there are degenerate cases when it is crucial to estimate the integral in (2.6) by means of 
Monte Garlo simulation instead of by proposing a single candidate. As it has been noted by Andrianakis and 
Ghallenor [2012], two extreme cases for the Gaussian process length-scale hyper-parameters can be identified. 
One possibility is for p to approach infinity, which makes every design point dependent of each other; the 
other, when p approaches the origin where a multivariate regression model becomes the limiting case. In 
the first case, high correlation among all the training runs results in a model which is not able to distinguish 
local dependencies. As for the second, it violates the assumptions that constitute a Gaussian process, by 
completely ignoring the correlation structure in the design points to predict the output. Consequently, if 
MCMC is performed one can approximate the integrated predictive distribution in (2.6) by means of 

N 

p(y*|x*,X>) « '^Wi p{y*\x*,V,9P, (2.9) 
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where 6i is obtained through an appropriate sampler, i.e. one capable of sampling from multi-modal distri¬ 
butions. The coefficients Wi denote the weights of each sample generated. Since each term p( 2 /*|x*, P, 0^) 
corresponds to a Gaussian density function, the predictions are made by a mixture of Gaussians. 

Proposition 1. If the emulator output y* conditional on its configuration vector x* has a posterior density 
as in (2.9), then its mean function and covariance function can be computed as 

N 

N 

cov(x*,x') = '^Wt [{fj.i{x*) - fj.{x*)){fj,i{x') - y{x')) + co?;(x*,x'|0j)], 

i=l 

where p,i{x*) denotes the expected value of the likelihood distribution of y* conditional on the hyper-parameters 
6i, the training runs D and the input configuration x*. 

Proof. Equality in (2.10) is a direct application of the tower property of conditional expectation and (2.11) 
follows from the covariance decomposition formula using the vector of weights wi as an auxiliary probability 
distribution on the conditioning. ■ 

From equation (2.11) we can compute the variance, also known as the prediction error, of an untested 
configuration x* as 


( 2 . 10 ) 

( 2 . 11 ) 


N 

var(x*) = {{y^{x*) - ^l{x*) f + a^{x*)). (2.12) 

By doing this, a more robust estimation of the prediction error is made since it balances the predicted error 
in one sample with how far the prediction of such sample is from the overall estimation of the mixture. 

2.2. Prior distributions 

In order to perform a Bayesian treatment for the prediction task in equation (2.6) the prior distribution 
p(/ 3,(T^,^) in equation (2.8) has to be specified. Weak prior distributions are commonly used for j3 and cr^ 
[Oakley, 1999]. Such weak prior has the form 

p(^,c^^(/)) cx (2.13) 

where it is assumed a priori that both the covariance and the mean hyper-parameters are independent. Even 
more, j3 and cr^ are assumed to have an improper non-informative distribution. 

As for the length-scale hyper-parameter cf, a prior distribution p{4>) is still needed. In this case the 
reference prior [studied by Berger and Bernardo, 1992, Berger et ah, 2009] sets an objective framework to 
account for the uncertainty of </>, thus avoiding any potential bias induced by the modelling assumptions. 
This prior is built based on Shannon’s expected information criteria and allows the use of a prior distribution 
in a setting where no previous knowledge is assumed. That way, the training runs are the only source of 
information for the inference process. Additionally, the reference prior is capable of ruling out subspaces 
of the sample space of the hyper-parameters [Andrianakis and Challenor, 2011], thus reducing regions of 
possible candidates of Gaussian distributions in the mixture model in equation (2.9). Since this provides an 
off-the-shelf framework for the estimation of the hyper-parameters, the reference prior developed by Paulo 
[2005] is used in this work. However, there are no known analytical expressions for its derivatives which limits 
its application to MGMG samplers that use gradient information. Note that there are other possibilities 
available for the prior distribution of (f>. Examples of these are the log-normal or log-Laplacian distributions, 
which can be interpreted as a regularisation in the norm of the parameters. Andrianakis and Challenor [2011] 
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suggest a decaying prior. Another option is to elicit prior distributions from expert knowledge as in Oakley 

[ 2002 ], 

2.3. Marginalising the nuisance hyper-parameters 

The nature of the hyper-parameters /3, cr^ and </> is potentially different in terms of scales and dynamics, 
as seen and explained in Figure 2. It is possible to cope with this limitations by using a Gibbs sampling 
framework, but it is well-known that such sampling scheme can be inefficient if it is used for multi-modal 
distributions in higher dimensions. Analogously, a Metropolis-Hastings sampler can also be overwhelmed. 


Number of training runs = 18 




(a) Correlations (b) Scales 

Figure 2: In 2(a), different dynamics of the hyper-parameters for the log-posterior distribution of test function 5.1 are 
shown: A. corresponds to positive correlation. B. corresponds to an independent region. C. corresponds 
to negative correlation. In 2(b), the marginal log-posterior function of the same example with h{x) = 1, 
presents the same contour level for a wide range of (3. Thus, the hyper-parameters exhibit very different 
scales. The dot represents the minimum of the corresponding function. 


Another alternative is to focus on (p and perform the inference in the correlation function. This is done 
by regarding /3 and as nuisance parameters and integrating them out from the posterior distribution 
(2.8). The modelling assumptions in the training runs and the prior distribution, equations (2.5) and (2.13) 
respectively, allow to identify a Gaussian-inverse-gamma distribution for (3 and cr^, which can be shown to 
yield the integrated posterior distribution 

p{cl)\V) cx p{(p) (ct^)”"^ (2.14) 

where 


and 

^ (2.16) 

are estimators of the signal noise and regression coefficients (3 [see Oakley, 1999, for further details]. 
Additionally, the predictive distribution conditioned on the hyper-parameters follows a Gaussian distribution 
with mean and correlation functions 

p(sT\cP) = h{K*f^ + t{K*fK-\y - H^), (2.17) 

corr(x*, w*|^) = fc(x*, w*|^) — t{'x*)'^K~^ + 

(h(x*)^ - (/i(w*)^ - , (2.18) 


d2 = 


n — p — 2 


(2.15) 
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where x*, w* denote a pair of test configurations and t(x*) denotes the vector obtained by computing the 
covariance of the new proposal with every design point t(x) = (fc(x, Xi|</>), ... fc(x, x„|</>))^. Note that both 
estimators depend only on the correlation function hyper-parameters (p since both (3 and cr^ have been 
integrated out. Considerations of when it is appropriate to integrate out the hyper-parameters in a model has 
been discussed by MacKay [1996]. In the Gaussian process context it gains additional significance since it 
allows the development of appropriate MCMC samplers capable of overcoming the dynamics of different sets 
of hyper-parameters. 

In the light of the above discussion, this work focuses on the inference drawn from the correlation function 
fc(-, •) in equation (2.2), since the structure of dependencies of the training runs to predict the outputs is 
recovered by it. The main assumption is that the mean function hyper-parameter /3 contains minor information 
on the structural dependencies of the data, relative to the correlation function hyper-parameters, which would 
prevent the use of integrated likelihoods [see Berger et ah, 1999, for further discussion]. If prior information is 
available, then an additional effort can be made on eliciting an appropriate mean function for the Gaussian 
process emulator. Such information can be related to expert knowledge of the simulator which eventually 
allows the analyst to build a better mean function by adding significant regression covariates [see Vernon 
et al., 2010, for a detailed discussion]. 


3. AIMS Framework 

Hyper-parameter marginalisation by means of Monte Garlo methods in Gaussian processes is usually 
performed by Hybrid Monte Carlo methods [Neal, 1998, Williams and Rasmussen, 1996] which are capable of 
suppressing the Random Walk behaviour of MCMC samplers if tuned correctly. In this work, the sampling of 
the hyper-parameters is done by means of Asymptotically Independent Markov Sampling (AIMS) [Beck and 
Zuev, 20I3[. This method combines techniques developed for Bayesian inference such as Importance Sampling 
and Simulated Annealing [Kirkpatrick et ah, 1983] to sample from the posterior distribution as done by other 
MCMC algorithms. Additionally, AIMS can also be adapted for global optimisation (AIMS-OPT) [Zuev and 
Beck, 2013] in a fashion of the traditional simulated annealing method for stochastic optimisation. Let the 
problem be 


min 'H((/)|I?), (3.1) 

where 7^(011?) denotes the negative log-posterior distribution conditional on the set of training runs T). Let 
the set of optimal solutions to the optimisation problem above be 

<1>* = |</>S<1> : cj) = argmin 'H{cp\D) 1 , (3.2) 

( </>6* J 

where |<I’*| > 1. This formulation acknowledges the presence of multiple global optima in the posterior 
distribution conditional on the training runs. It is important to note that using the logarithm of the posterior 
distribution reduces the overflow in the computation of the equation (2.14), which is likely to arise due to 
ill-conditioning of the matrix K [Neal, 2003]. 

In this context, AIMS-OPT is capable of producing samples by means of a sequence of nested subsets 
^/c+i C that converges to the set of optimal solutions $*. Thus, if the algorithm is terminated in a 
premature step, a set of sub-optimal approximations to (3.2) will be recovered. Let {pfe(</>|T>)}^i be the 
sequence of density distributions such that 

Pfe(</>|X>) (X p(</)|X>)^/^'= = exp {-'H{cj)\D)/Tk} , (3.3) 

for a sequence of monotonically decreasing temperatures r^. By tempering the distributions in this manner, 
the samples obtained in the first step of the algorithm are approximately distributed as a uniform random 
variable over a practical support; while in the last annealing level, they are distributed uniformly on the set 



of optimal solutions, namely 


(3.4) 

(3.5) 


lim Pr{4>\V) = U^{4>), 

T—¥OCi 

lim Pr(0|2?) = U^^iyCj)), 

where Ua{4>) denotes a uniform distribution over the set A for every cf) G A. 

3.1. Annealing at level k 

The general framework for the AIMS-OPT algorithm is presented, focusing on how to sample from the 
hyper-parameter space at level k based on the sample of the previous level. Let 0^^ ^\ ..., be samples 

of the hyper-parameters distributed as pk-i{<p) at level k — 1. For notational simplicity, the conditional on 
V will be omitted from pk-i{‘), however the training runs are crucial to build statistical surrogates. The 
objective is to use a kernel such that pfc(-) is the stationary distribution of the Markov chain. Let Vk denote 
such Markov transition kernel, which satisfies the continuous Chapman-Kolmogorov equation 

Pk{(t>)d(f)^ [ Vkid4>\^) pk{^) d^, (3.6) 

where Pk{d(p) = Pki(f>) dcj) denotes the probability measure. By applying importance sampling using the 
distribution at the previous annealing level, equation (3.6) can be approximated as 

Pk{(t>)d(f)= [ Vk{dcf)\^) pfc-i(|)d| 

J<s> Pfe-i(l) 

N 

~ = Pk,N{d4>), (3.7) 

j=i 

where Pk,N{') is used as the global proposal distribution for a candidate in the chain and 



are the importance weights and the normalised importance weights respectively. Note that for computing 
the normalising constant of the integrated posterior distribution (2.14) is not needed. 

The proposals of candidates for the chain are done in two steps. In the first step, a candidate is drawn 
as an update from a random marker from the sample of the previous annealing level, checking whether it is 
accepted or not. If the local candidate is rejected by a Random Walk Metropolis-Hastings evaluation, then 
the chain remains invariant, and another marker is selected at random. In the second step, given 

the candidate has been accepted as a local proposal, such candidate is considered as being drawn from the 
approximation in (3.7) and accepted in an Independent Metropolis-Hastings framework, hence called a global 
candidate for the chain. Let qk{'\') denote the symmetric transition distribution used for local proposals for 
the Markov chain. The subscript k accounts for the adaptive nature of the transition steps in each annealing 
level. Thus, the kernel distribution of the Random Walk, which leaves the intermediate density invariant, can 
be written as 


'Pkidcj)\$,) =qk{4>\^) min<^ 1 


Pk{i) 


dcj) + {I- afc(l)) 6^{dcj)), 


(3.10) 
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where (5^(d0) denotes a delta density and ak{^) is the probability of accepting the transition from ^ to 
It follows from (3.7) that the approximated stationary condition of the target distribution at annealing level k 
can be written as 


N 


Pk,N{(f>) = ^ 


w) Qk 




(P) afe 


(0 




(fe-i) 


with 


“fe (^10) =min<^ 1 


Pk{$,) 


' Pk (^) 

the probability of accepting the local transition; whereas 

Pk{OPk,N{(f}) 


(I l</>) =niin<j 1, 


Pk (0) Pk,Nii) J ’ 


(3.11) 


(3.12) 


(3.13) 


denotes the probability of accepting such candidate for the Markov chain, hence accepting a global transition 
[see Zuev and Beck, 2013, for a detailed discussion]. This leads to the following two algorithms for each level 
in the annealing sequence. 

Algorithm 1: AIMS-OPT at annealing level k 

Input 

o 0^^ , (t>N Pk-ii<P)t generated at previous level; 

o G $\ • I ii^iti^-l state of the chain; 

o gfc(0|4), symmetric local proposal; 

Output: 

for i ^ 2 to n — 1 do 

(1) Generate a local candidate using the previous level samples as “markers” 




N 


= E 


rrlfc-l) 






(3.14) 


(a) Select index j with probability proportional to importance weights wj* 

(b) Generate candidate from the local proposal distribution 


I ~ qk {$. 




(fe-i) 


(c) Accept ^ as a local candidate with probability 






(k-i) 


(3.15) 

(3.16) 


(2) Update —>■ accepting or rejecting ^ using Algorithm 2. 


end 
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Algorithm 2: Global acceptance of ^ 


if 4 was accepted as local candidate then 

Accept ^ as a global transition with probability 

else 

Leave the chain invariant 



4> 


(k) _ 
i+i “ 



end 


(3.17) 

(3.18) 


According to Algorithm 1 the initialising step should also be provided for the annealing level. In 
practical implementations it is suggested that it should be considered to be ~ where 

j = argmaxi i.e. the sample with the largest normalised importance weight. 

3.2. Adaptive proposal distribution and temperature seheduling 

Even though a Random Walk is performed in every local proposal, AIMS-OPT performs efficient sweeping 
of the sample space by producing candidates from neighbourhoods of the markers from the previous annealing 
level {<Pj^ This is accomplished if the transition distribution ^^) uses an appropriate proposal 

distribution where sampling is to be realised; namely, the level curves of the tempered distribution. To be 
able to cope with the non-negative restriction and to neglect the effect of the scales on each dimension, the 
transitions are performed in the log-space of the length-scale parameters cp, as suggested by Neal [1997]. The 
symmetric transition distribution proposed is a Gaussian distribution for such log-parameters. That is, each 
local candidate will be distributed as 




cpf-^\cA 


)■ 


(3.19) 


where is a decaying parameter for the spread of the proposal, i.e. = vck-i with v G ( 0 , 1 ) commonly 
chosen as = 1/2 [Zuev and Beck, 2013]. The matrix Efc denotes the covariance matrix for log-parameters 
where typical choices can be the identity matrix Ipxp, a diagonal matrix or a symmetric positive definite 
matrix. We propose the use of the weighted covariance matrix estimated from the sample and their importance 
weights of the previous level ..., By doing so, the scale and directions of 

the ellipsoids of the Gaussian steps are learned as in Adaptive Sequential Monte Carlo methods [Haario et ah, 
2001, Fearnhead and Taylor, 2013] from the information gathered from the previous level in the sequence. 

The annealing sequence and its effective exploration of the sample space is dictated by the temperature 
Tk of the intermediate distributions. Moreover, it defines how different is one target distribution from the 
next one, so the effectiveness of the sample as markers from the previous annealing level depends strongly on 
how the scheduling is performed. It is clear that abrupt changes lead to rapid deterioration of the sample, 
whilst low paced changes could produce unnecessary steps in the annealing schedule. In order to cope with 
this compromise, Zuev and Beck [2013] used the effective sampling size to determine the value of the next 
temperature in the process. That is solving for Tk, when 

(E"=.exp{-«(.Af-'>)(,. 

where 7 defines a threshold for the proportion of the sample to be as effective from the importance sampling. 
Note that the value of 7 defines additionally how many annealing steps will be performed. As suggested from 
Zuev and Beck [2013] a value of 1/2 is used for such parameter. 


a sample from level k — 1 has been produced, in 

(--—)] 

\rk Tfc_i J J 




1 

771 ’ 


(3.20) 
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3.3. Stopping condition 

If the temperature continues to drop along the sequence of intermediate distributions, eventually an 
absolute zero = 0 would be reached. However, such limit cannot be achieved in practical implementations 
and a stopping condition is needed for the algorithm. By the same assumptions as in the original paper [Zuev 
and Beck, 2013] and without loss of generality, the objective function 'H{4>) is assumed to be non-negative. 
Similarly, let 5k denote the Coefficient of Variation (COV) of the sample i.e. 

(3.21) 

Therefore, 5k is used as a measure of the sensitivity of the objective function to the hyper-parameters in 
the domain d)*^. If the samples are all located in then their COV will be zero, since Vj = 

min(^g$* As the progression of the intermediate distributions advances with k, it is expected that 

5k —> 0. As a consequence, a criteria to stop the annealing sequence is needed, and the algorithm will stop 
when the following condition is attained 



[n 

1 - ; 

1 

N l^j-- 


(.-)) 


1 

N 


1^1 

W) 

1 


5k <0^0 — ^target; (3.22) 

where a is assumed to be 0.10 in practical implementations to prevent the algorithm to generate redundant 
annealing levels in the last steps of the procedure. Note that the stopping criterion (3.22) is used to drive 
the simulated annealing temperature towards the absolute zero. However, if the aim is not localising modes 
as in stochastic optimisation, and a more traditional oriented sampling is required, the algorithm could be 
truncated in a temperature value of 1. This adds an additional layer of flexibility to the algorithm which 
other stochastic-search approaches do not share. 

3 . 4 . Parallel implementation and guarding against rejection 

As found in our earliest experiments, AIMS-OPT with the global acceptance rule as in Algorithm 2 might 
degenerate quickly in higher dimensions since the starting of the chain comes from the highest normalised 
weighted sample and a transition might take too long to be performed, resulting in high rejection rates. 
Furthermore, information from the markers is lost since they do not provide good transition neighbourhoods 
and the ability to create new samples for the next annealing level is maimed. This aside, AIMS-OPT can 
become computationally expensive when the number of samples increases. To cope with these limitations we 
propose to incorporate the Transitional Markov Chain Monte Carlo (TMCMC) and the Delayed Rejection 
methods into the AIMS-OPT framework. This extension not only enhances the mixing properties of the 
sampler, i.e. improve acceptance rates, but also provides a computational framework in which parallel Markov 
chains can be sampled from the intermediate distributions Pk{(t>) of fho length-scale hyper-parameters. 

The idea to enable parallelisation comes from the TMCMC algorithm [see Ching and Chen, 2007, for 
further details]. In the framework of Algorithm 1, every marker from the annealing level fc — 1 is a starting 
point for a Markov chain. This produces not only specialised chains which are likely to explore the marker’s 
neighbourhood on the sample space, but also allows an assessment of which markers will generate a better 
chain. The normalised weights will dictate how deeply a chain will evolve starting from its marker . 

Consequently, the number of samples in each chain will be set with probability proportional to the normalised 
weight, a direct result from the TMCMC algorithm. 

In order to guard against high rejection rates, and therefore degeneracy on the sampling scheme, we 
propose to generate an additional candidate if the first one is rejected as in Delayed Rejection Algorithms 
[Mira, 2001]. Let iS'i(-|-), S' 2 ('|’) •) be a one step and two steps proposal density distributions respectively; 7r(-) 
the target distribution of the Markov chain and ai(-, •) the probability of accepting a transition in one step. 
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Then, the probability of accepting a transition in two steps, denoted by a 2 (-, •), is 


a2{<t>o,4>2) = min 


T^{4’2) ‘S'l (^1 |</>2) ^2,{4’q\4^2^ 4>i) Qi( 02; ^l)) 1 

’ 7’'((/)o)S'i(^i|</>o)S'2(02l^O)‘/'i) (1 - a.i{(t>oAi)) I ’ 


(3.23) 


where (pQ denotes the starting point, cf>i the rejected candidate and 02 the second stage candidate. In our 
context, the target distribution 7r(-) is each annealing level Pk{’) density distribution, the one step proposal 
distribution 5'i is the independent approximation in equation (3.11) and the one-step acceptance probability 
is the global acceptance probability in (3.13). The two-step proposal density S2 can be chosen from several 
alternatives. In this work we use a symmetric distribution centred at the starting point 0 q, since it can be 
seen as a back-guard against Si being a deficient independent sampler [see Zuev and Katafygiotis, 2011, for a 
detailed discussion]. Therefore, the previous equation can be rewritten in compact form as 


afc.2(0o> ^ 2 ) = min 


, Pki(t>2) - <^U^l\(t>2)) ] 
’p,(0o)(l-a«(0J0o))i ’ 


(3.24) 


where af(’|’) is defined as in equation (3.13). The fact that S2 is a symmetric distribution centred in the 
starting point 0o has been used, S' 2 ( 02 l</'o>‘/’i) = 5 (</> 2 l‘/’o) = ff(^ol</’ 2 ) = ‘S' 2 ( 0 ol‘/’ 2 : 0i)) where 5 (-|-) 
denotes such symmetric proposal density. By performing the second stage proposal, the stationary condition 
of Pki’) is maintained as stated in the following proposition. 

Proposition 2. AIMS-OPT coupled with delayed rejection in two stages leaves the target distribution pk{’) 
invariant at each annealing level. 

Proof. See Appendix A for a proof using a general transition distribution S' 2 ( j-, •). I 

From the above discussion, the proposed scheme provides a fail-safe against any possible mismatch of the 
approximation done with (3.11). Additionally, the results presented in this paper correspond to the second 
step candidate being a Gaussian random variable, ^ ~ A/'(0 [^^[cqS/;). The ideas to accept a global transition 
after having accepted a local proposition can be summarised in Algorithm 3. 

Algorithm 3: Global acceptance using delayed rejection 

if ^ was accepted as local candidate then 

Accept ^ as a global transition with probability 




(3.25) 

Generate a second candidate ^2 from 



^2 

~Af(0f IcoSfc) 

(3.26) 

if ^2 is accepted with probability 0 ^, 2 (0,-^^ 

1 ^ 2 ) computed as in equation (3.24) then 


else 

4% = ^2 

(3.27) 

end 


(3.28) 


end 
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4. Implementation Aspects 

The computational complexity of the posterior distribution in equation (2.14) is governed by the inverse 
of the covariance matrix K as it scales with the number of training runs N. Several solutions have been 
developed in the literature, such as computation of inverse products of the form K~^u, with u G by means 
of Cholesky factors or Spectral Decomposition [see Golub and Van Loan, 1996, for efficient implementations] 
to preserve numerical stability in the matrix operations [see Gibbs, 1998] . Nonetheless, numerical stability is 
not likely to be achieved if the training runs are very limited, or if the sampling scheme for such training runs 
cannot lead to stable covariance matrices, as depicted in Figure 3. 



10 '® 10 ° 10 ^ 

A 

(a) Numerically unstable surface 


(b) Numerically stable surface 


Number of training runs = 18 


Number of training runs = 18 


Figure 3: 


Projection of the negative log-posterior curves in the two dimensional length-scale space. Adding the nugget 
(j)s results in a numerically stable surface. 


To overcome this practical deficiency, a correction term in the covariance matrix can be added in order to 
preserve diagonal dominancy, that is, we add a nugget hyper-parameter (pg to the covariance such that 

Ks = K + cPsI, (4.1) 

is positive definite. Doing so results in the stochastic simulator 

yi = igiK,)-\-a'^ cpg. (4.2) 

Note that the interpolating quality of the Gaussian process is lost, however, the term tr^ cpg accounts for 
the variability of the simulator that cannot be explained by the emulator given the original assumptions 
(adequacy of the covariance function, for example). The nugget can also provide further quantification of 
model uncertainty in the inference process as it provides an alternative to smoothing an already complex 
surface. As it is also noticed by Andrianakis and Challenor [2012] and Ranjan et al. [2011], the quality 
of the emulator changes with the inclusion of the nugget, since it modifies the objective function itself by 
introducing new modes in the landscape of the posterior distribution. The configuration reflected by new 
modes in these cases might correspond to emulators with no local dependencies and an overall simple trend, 
defined from the basis functions and regression hyper-parameters (3. Therefore, if a Gaussian process with 
no local dependencies, e.g. with its mode farther away from the origin in the length-scale space, is assessed 
as not appropriate for the model, a regularisation term can be added in the optimisation formulation as in 
[Andrianakis and Ghallenor, 2012] . This corresponds to precautions for the inclusion of the nugget and can 
be seen as elicited prior beliefs on the Bayesian formulation. However, by using a multi-modal sampler for 
stochastic optimisation as the one proposed, a robust emulator capable of mixing various possibilities can 
be provided. This results in an emulator that is able to cope with violations to the modelling assumptions 
originated by working with a limited amount of training runs. 

We incorporate the nugget term cpg as a hyper-parameter of the correlation function in the Bayesian 
inference process. As suggested by Ranjan et al. [2011] a uniform prior distribution 1) for such 
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parameter is considered. The effect of the bounds is twofold. First, the lower bound is used to guarantee 
stability in the covariance matrix. Second, the upper bound is used to force the numerical noise of the 
simulator to be smaller than the signal noise of the emulator itself. Note that this last assumption can be 
omitted if the problem requires it. By considering the correlation matrix as in equation (4.1), this yields 

= (4.3) 

where Ks denotes the corrected correlation matrix and has been used to denote the covariance matrix of 
the Gaussian process. By doing so it is clear that previous considerations regarding cr^, such as the ability of 
marginalising it as a nuisance parameter and the use of a non-informative prior remain unchanged [De Oliveira, 
2007]. 


5. Numerical Experiments 


To illustrate the robustness of estimating the hyper-parameters of a Gaussian process using the parallel 
AIMS-OPT framework, three test cases have been selected. The first two are common examples that can be 
found in the literature. The first is known as the Branin function and has been modified to resemble usual 
properties of engineering applications [Forrester et ah, 2008]. The second one [Bastos and O’Hagan, 2009] has 
been used as a two dimensional function with a challenging complexity for emulating purposes. The third 
example presented in this section comes from a real dataset also presented in Bastos and O’Hagan [2009]. In 
all the examples it is assumed that h(x) = (l,^!, ... ,Xp)^. Regarding the nugget, a sigmoid transformation 
has been performed in order to sample from a Gaussian distribution. Namely, we sample an auxiliary zs as 
part of the multivariate Gaussian in (3.19), and compute the nugget as 


Os 


1 — lb 

1 -I- exp(-z5) 


+ 4 , 


(5.1) 


where lb is the lower bound for the nugget, which is set equal to 10“^^. Additionally, the uniform meta-prior 
distribution of equation (3.4) has been considered in a practical support of the length-scale parameters in the 
logarithmic space, namely a uniform distribution with support in [—7, 7]. For the nugget, a truncated beta 
distribution with parameters a = P = 0.5 has been considered since it corresponds to a non informative meta¬ 
prior in the interval [lb, 1]. Here the prefix meta has been used to refer to the algorithm’s prior distribution 
and to set a clear distinction from the prior used in the modelling assumptions in equation (2.13). 

The code has been implemented in MATLAB and all examples have been run in a GNU/Linux machine 
with an Intel i5 processor with 8 Gb of RAM. For the purpose of reproducibility, the code used to generate 
the examples is available for download at https://github.com/agarbuno/paims_codes. 


5.1. Branin Function 

The version of the Branin function used in this paper is a modification made by Forrester et al. [2008] 
for the purpose of Kriging prediction in engineering applications. It is a rescaled version of the original in 
order to bound the inputs to the rectangle [0,1] x [0,1], with an additional term that modifies its landscape 
to include a global optimum. Namely, 

/(x) = ^^2 - + “^1 “6^ +10 

where xi = 15x1 — 5 and X2 = 15x2. 

For this case, a sample of 18 design points were chosen with a Latin hypercube sampling scheme. The 
resulting log-posterior function possesses 4 different modes in its landscape (see Figure 4(a)) leading to 4 
possible configurations of the correlation function. Thus, the impact of the training runs used to construct the 
emulator is evident. Among these modes, 4 different types of emulators can be distinguished: an emulator with 


1 - ^ ) cos(xi) -f 1 


5xi, 


(5.2) 
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high sensitivity to changes in input xi (mode A in Figure 4(a)); an emulator with rapid changes in X 2 for the 
correlation structure of the training runs (mode B); a limiting case where dimension X 2 is disregarded in the 
correlation function, due to a high value in 02 (mode C); or a second limiting emulator which approximates a 
Bayesian linear regression model (mode D) [see Andrianakis and Challenor, 2012, for a detailed discussion]. 



(a) Level curves and modes 


(b) Parallel AIMS-OPT sample 


(c) Residual plots 


Figure 4: Projection of the negative log-posterior curves in the two dimensional length-scale space for the Branin 
simulator. The minimum possible value of 10“^^ for the nugget (pg has been used for such projection. The 
reference diagonal helps visualise the regions where the length scales favour one dimension over the other. 


For this example, two thousand samples were generated in each annealing level. The parallel AIMS-OPT 
algorithm generated 7 annealing levels to produce the samples in Figure 4(b). The RMSE of the MAP model 
is 7.068 whereas the RMSE of the mixture is 15.099 which is an indication that in terms of brute prediction, 
the mixture model could be improved by taking more samples. Figure 4(c) depicts the standardised residuals 
from both the MAP approach (top) and the mixture model (bottom) using equations (2.10) and (2.11) with 
uniform weights in the sample. The standardised residuals are defined as 


r(x) 


y - m(x) 


(5.3) 


where y is the output for configuration x, /r(x) = A[j/|x,I?] and cr^(x) = var(?/|x,I?), the posterior mean and 
variance for configuration x [see Bastos and O’Hagan, 2009, for a more detailed discussion on diagnostics]. 
By marginalising the hyper-parameters it is clear that our estimation is a more robust in terms of error 
prediction. Even with such limited amount of information the residuals suggest that the uncertainty is 
being incorporated appropriately in the marginalised predictive posterior distribution in equation (2.6). The 
standardised residuals are inside the 95% confidence bands, assuming approximate normality, though not too 
close to 0. This is an indicator that although greater variability is expected, excessively large variances are 
avoided. This is done by means of the integrated predictive distribution and the use of the proposed sampler 
to build a mixture of emulators leaving the predicted errors inside appropriate bounds. 


5.2. 2D Model 

This function has already been used as an example for emulation purposes and can be found in GEM- 
SA software web page (http;//ctcd.group, shef.ac.uk/gem.html). Even though it is a two dimensional 
problem it also serves as a good illustration of the importance of estimating the hyper-parameters of a 
Gaussian process with a multi-modal sampler. The mathematical expression for this simulator is 


/(x) 


1 — exp 



/2300a;? -b 1900a;? -b 2092a;i -b 60\ 
100a;? -b 500a;? -b 4a;i -b 20 ) ' 


(5.4) 
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As in the previous case, the training runs and the modelling assumptions fail to summarise the uncertainty 
in a uni-modal posterior distribution. The design points where selected using a Latin hypercube in the 
rectangle [0,1] x [0,1]. It can be seen from Figure 5(a) that the modes are separated by a wide valley of low 
posterior probability, which can become an overwhelming task for traditional MCMC samplers. The proposed 
sampler is able to cope with all local and global spread dynamics present in the neighbourhoods of the modes 
it encounters, as shown in Figure 5(b). 



Number of training r uns = 20. Temperature = 0.26 


10 '^ 10 ° 10 ^ 

(b) Parallel AIMS-OPT sample 


Number of training runs = 20 


10 '® 10 ° 10 ^ 

(a) Level curves with reference prior 


Number of training runs = 20 



(c) Level curves with uniform prior 


0 5 10 15 20 25 

Residual plot using mixture model 


5 10 15 20 

Index 

(d) Residuals plot 


Residual plot using MAP 

10 r 


Figure 5: Projection of the negative log-posterior curves in the two dimensional length-scale space for the 2D Model 
simulator. The minimum possible value of 10“^^ for the nugget ipg has been used for such projection. The 
reference diagonal helps visualise the regions where the length scales favour one dimension over the other. 


Depicted in Figures 5(a) and 5(c) the use of the reference prior in the posterior distribution removes 
probability mass from the neighbourhood around the origin. This validates the use of the reference prior 
to cut out regions from the space of hyper-parameters for the sampling and exploit the most information 
contained in the data available, namely, the training runs D. As in the previous example, two thousand 
samples were generated in each annealing level. The parallel AIMS-OPT algorithm generated 7 annealing 
levels to produce the samples in Figure 5(b). In terms of prediction accuracy, we now obtain that the RMSE 
is 1.356 for the MAP estimate and 1.345 for the mixture model. While as for the residuals, we can see from 
Figure 5(d) that the mixture model allows for a more robust prediction of the error, by means of increasing 
the variability in particular locations. This can be seen as the standardised residuals are concentrated within 
the 95% confidence bands of an approximate assumed normality, resulting in a more robust estimation of the 
error by the use of a mixture model. This motivates the use of multi-modal density samplers in the context of 
optimisation, where if a single candidate is provided the overall error prediction of the emulator might be 
biased towards more concentrated predictions around the mean estimation. 
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5.3. Nilson-Kuusk Model 

This simulator is built from the Nilson-Kuusk model for the reflectance for a homogeneous plant canopy. 
Such model is a five dimensional simulator whose inputs are the solar zenith angle, the leaf area index, relative 
leaf size, the Markov clumping parameter and a model parameter A [see Nilson and Kuusk, 1989, for further 
details on the model itself and the meaning of the inputs and outputs]. For the analysis presented in this paper 
a single output emulator is assumed and the set of the inputs have been rescaled to fit the hyper-rectangle 
[0,1]® on a five dimensional space as in Bastos and O’Hagan [2009]. 

As in the previous test cases, the design points were chosen by Latin hypercube designs (100 for this 
case). In this example, the dimension of the problem makes it impossible to plot the level curves of the 
posterior distribution for the length scale hyper-parameters to visualize potential multiple modes. However, 
the samples can be visualized by means of a box-plot as shown in Figure 6, where the red line denotes the 
median, the edges of the box the 25**' and 75*"^ percentiles, and the whiskers cover the most extreme cases. 
The samples are obtained after completing 10 levels of the parallel AIMS-OPT algorithm. The box-plots 
of the approximate optimal solutions strongly suggest that the samples come from a multi-modal posterior 
distribution. This can be seen from the location of the edges of the boxes and the median for any given input. 
The last input possesses a very limited spread which might denote a high concentration around one mode. 
Note that as the magnitude of the length-scale increases, thus reducing the sensitivity of the simulator to such 
input, the length-scales are located in what can be seen as either a plateau or regions of modes with negligible 
difference in the posterior density. Additionally, from the range of values that are covered in log-space, it can 
be noted that the output of the simulator appears to be insensitive to changes of the third and fourth input. 
Furthermore, a limit-case emulator can be suggested by the box-plot in Figure 6 by considering a surrogate 
with no third and fourth inputs in the model. Notice the scales for such hyper-parameters in logarithmic 
space. 



01 02 03 04 05 


Figure 6: Box-plots of the sample of length-scales obtained by the parallel asymptotically independent Markov sampling. 


Due to the larger number of dimensions, five thousand samples were generated for each annealing level. In 
this case we have that the RMSE of the MAP estimate is 0.022 while the RMSE of the mixture proposal 
is 0.021 which is a consequence of the posterior distribution being highly concentrated around one mode, 
in a particular set of length-scales {(pi, <p 2 and p^) while being less specialised for the less sensitive ones {p 3 
and pi). In Eigure 7 there is evidence that even with such behaviour the predictive error is improved by 
narrowing the spread of the standardised residuals, as before, a consequence of an increased estimation of 
the variability in particular locations. In this case the residuals cannot all be contained in the approximate 
normality 95% bands but as noted by Bastos and O’Hagan [2009] in their experiments there is strong evidence 
that more runs of the simulator are needed to adequately built a statistical surrogate. Due to the highly 
concentrated posterior density around the high sensitive length-scales there seems to be no apparent gain 
from using the mixture model. However, it can be noted from Figure 6 that by acknowledging the variability 
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of the hyper-parameters, a better understanding of the sensitivity of the simulator with respect to the inputs 
is achieved. An improved and more robust uncertainty analysis of the simulator can be provided in this 
case understanding the wide spread of length-scales for particular dimensions. For instance if screening is 
performed, the MAP estimate will fail to summarise the wide posterior density with respect to 4>^ and 
and this in turn, will provide partial information. This analysis cannot be performed solely by maximising 
the posterior density. Therefore the proposed method provides additional insight of the sensitivity of both 
simulator and emulator. 


Residual plot using MAP 
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Residual plot using mixture model 
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Figure 7: Residuals plot for the Nilson-Kuusk simulator 


6. Conclusions 

This paper proposes to estimate the hyper-parameters of a Gaussian process using a new sampler based 
on the Asymptotically Independent Markov Sampling (AIMS) method. The AIMS-OPT algorithm, used in 
stochastic optimisation, provides a robust computation of the MAP estimates of the hyper-parameters. This 
is done by providing a set of approximations to the optimal solution instead of a single approximation as it is 
so frequently done in the literature. The problem is approached in a combined effort from the computational, 
optimisation and probabilistic perspectives which serve as solid foundations for building surrogate models for 
computationally expensive computer codes. 

The original AIMS algorithm has been extended to provide an efficient sampler in computational terms, 
by means of parallelisation, as well as an effective sampler with good mixing qualities, by means of both the 
delayed rejection and adaptive modification exposed. It has been demonstrated that by using the parallel 
AIMS-OPT algorithm it is possible to acknowledge uncertainty in the structure of the emulator proposed 
as illustrated in the examples provided. Structural uncertainty should be taken into account to determine 
when the training runs available are sufficient to narrow the posterior distribution of the hyper-parameters 
to a uni-modal convex distribution. Even though it has been proven to be effective in lower and medium 
dimensional design spaces, research in high dimensional spaces has been left for future research. 
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Appendix A. 

In this appendix, a proof that using the delayed rejection algorithm in the AIMS framework leaves the 
target distribution pk{’) invariant is provided. 
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A sufficient condition to prove that indeed Pk{ ) is the stationary distribution for the Markov chain is to 
prove that the detailed balance condition is satisfied. Since the first stage approval has been proven to satisfy 
the detailed balance condition in Zuev and Beck [2013], it will only be proved for the second stage sampling. 

Let fk{4>2\4>o) describe the AIMS-OPT delayed transitions in the fc-th annealing level from —)■ </> 2 , with 
02 7^ 00- 4>i be the rejected transition in the first stage, for any 0g, 0]^, 02 G ..., 0^^“^^}. It 

will be proved that for such candidates the following holds: 

13fc(^o)/2(02l^o) =Pk{(t}2)f2{(f>o\<p2)- (A-l) 

As seen from the description in section 3.4 it follows that 



generate cf)^^ reject cf)^ generate <p2 accept <p2 


(A.2) 


where it is used the fact that AIMS-OPT generates first stage proposals with an independent approximate 
distribution. Recall that the probability of a second stage proposal is 


02(00,^2) = min 


^ Pfc(02) >5'2(0 o|02, 0i) (1 - ai(02>l)) ] 

Pfe(0o) *5'2(02|0O) ^l) (1 ~ <3:1(00, 0i)) / 


(A.3) 


and the fact that for any two positive numbers a,b the equality a min{l,6/a} = b min{l,a/6} is satisfied. 
With these two equalities we can substitute the left hand side of equation (A.l) as 


Pfc(0o)/2(02l^o) = Pfc.n( 0 l) [P/c(</'o) 5'2(02|0O> </’l) (1 “ ai(0O> ‘/’l))] «2(0O, ^ 2 ) 

— Pk,ni4’l) [Pfc(</’2) *^2(001^21 (i- ~ ®l(^2i *^l))] ®2(02 j^o) 

= Pfc(02) /2(0 oI02)> (A.4) 


which proves the detailed balance for the second stage proposal. Note that the proof has been made with no 
further assumptions about the second stage proposal distribution 5'2(02|0o, 0i), ii' ‘^an be defined from 
several candidates. In this work, a symmetric proposal that ignores the rejected sample has been used since it 
can be interpreted as a Random Walk safeguard against a possible ill approximation done by the independent 
sampler. 
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